[SPARK-44647][SQL] Support SPJ where join keys are less than cluster keys #42306

szehon-ho · 2023-08-02T22:43:16Z

What changes were proposed in this pull request?

Add new conf spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled
Change key compatibility checks in EnsureRequirements. Remove checks where all partition keys must be in join keys to allow isKeyCompatible = true in this case (if this flag is enabled)
Change BatchScanExec/DataSourceV2Relation to group splits by join keys if they differ from partition keys (previously grouped only by partition values). Do same for all auxiliary data structure, like commonPartValues.
Implement partiallyClustered skew-handling.
- Group only the replicate side (now by join key as well), replicate by the total size of other-side partitions that share the join key.
- add an additional sort for partitions based on join key, as when we group the replicate side, partition ordering becomes out of order from the non-replicate side.

Why are the changes needed?

Support Storage Partition Join in cases where the join condition does not contain all the partition keys, but just some of them

Does this PR introduce any user-facing change?

No

How was this patch tested?

-Added tests in KeyGroupedPartitioningSuite
-Found two existing problems, will address in separate PR:

Because of [SPARK-40429][SQL] Only set KeyGroupedPartitioning when the referenced column is in the output #37886 we have to select all join keys to trigger SPJ in this case, otherwise DSV2 scan does not report KeyGroupedPartitioning and SPJ does not get triggered. Need to see how to relax this.
https://issues.apache.org/jira/browse/SPARK-44641 was found when testing this change. This pr refactors some of those code to add group-by-join-key, but doesnt change the underlying logic, so issue continues to exist. Hopefully this will also get fixed in another way.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala

sunchao · 2023-08-08T22:59:45Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala

+            // Support only when all cluster key have an associated partition expression key
+            requiredClustering.exists(x => attributes.exists(_.semanticEquals(x))) &&
+              // and if all partition expression contain only a single partition key.
+               expressions.forall(_.collectLeaves().size == 1)


hmm why this condition?

This was to fix a test, I couldn't find it back to be honest. There was a test somewhere that was trying this case (which isnt actually supported in the code currently), and I think asserting the right exception, which I think would break if SPJ is activated. I could revert this and see again to find the test, if you want.

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala

sunchao · 2023-08-16T20:09:29Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala

-              groupSplits = true).get
+            // In the case where we replicate partitions, we have grouped
+            // the partitions by the join key if they differ
+            val groupByExpressions =


Can we override KeyGroupedPartitioning method in this class, and wrap the logic of handling join keys in the method? We can return a new KeyGroupedPartitioning instance whose expressions, partitionValues are "projected" on the join keys.

Done, changed outputPartitioning to return KeyGroupedPartitoning to reflect that.

sunchao · 2023-08-16T20:18:01Z

sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala

-      node.mapChildren(child => populatePartitionValues(
-        child, values, applyPartialClustering, replicatePartitions))
+      node.mapChildren(child => populateStoragePartitionJoinParams(
+        child, values, partitionGroupByPositions, applyPartialClustering, replicatePartitions))


Instead of populating partitionGroupByPositions, can we populate StoragePartitionJoinParams.keyGroupedPartitioning instead? which can be the subset of expressions that participate in the join.

I will need some more guidance on this one.

szehon-ho · 2023-08-24T22:40:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala

+
+object KeyGroupedShuffleSpec {
+
+  def isExpressionCompatible(left: Expression, right: Expression): Boolean =


This is just grouping the new static into companion object, so the diff looks a bit bigger, let me know if I should revert

dongjoon-hyun · 2023-08-28T18:12:36Z

Could you re-trigger the failed pipeline or rebase this PR to the master branch once more, @szehon-ho ?

szehon-ho · 2023-08-28T18:21:28Z

Hi @dongjoon-hyun , I think @sunchao had another idea he is thinking about, was going to wait a bit for that to update the pr

dongjoon-hyun · 2023-08-28T18:29:13Z

Oh, got it!

sunchao

Thanks @szehon-ho ! Looks great with a few comments.

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

sunchao · 2023-08-25T15:56:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala

+            if (SQLConf.get.getConf(
+              SQLConf.V2_BUCKETING_ALLOW_JOIN_KEYS_SUBSET_OF_PARTITION_KEYS)) {
+              requiredClustering.exists(x => attributes.exists(_.semanticEquals(x))) &&
+                expressions.forall(_.collectLeaves().size == 1)


this deserves some comments since otherwise it's a bit confusing why we need it.

Added some comment, please check if it makes sense

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala

sunchao · 2023-09-07T16:38:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala

+            if (SQLConf.get.getConf(
+              SQLConf.V2_BUCKETING_ALLOW_JOIN_KEYS_SUBSET_OF_PARTITION_KEYS)) {
+              requiredClustering.forall(x => attributes.exists(_.semanticEquals(x))) &&
+                  expressions.forall(_.collectLeaves().size == 1)


this should be guaranteed currently - it might be better to have this invariant check somewhere else like when constructing a KeyGroupedPartitioning, but OK to leave it here for now

sunchao · 2023-09-07T16:39:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala

@@ -674,7 +711,8 @@ case class HashShuffleSpec(

 case class KeyGroupedShuffleSpec(
    partitioning: KeyGroupedPartitioning,
-    distribution: ClusteredDistribution) extends ShuffleSpec {
+    distribution: ClusteredDistribution,
+    joinKeyPositions: Option[Seq[Int]] = None) extends ShuffleSpec {


we can add some comments for KeyGroupedShuffleSpec to explain what is this for, otherwise it's a bit hard to understand.

Added comments, please check and suggest if it can be improved.

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

sql/core/src/test/scala/org/apache/spark/sql/connector/KeyGroupedPartitioningSuite.scala

szehon-ho · 2023-09-08T06:19:20Z

@sunchao thanks! addressed review comments

…keys ### What changes were proposed in this pull request? - Add new conf spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled - Change key compatibility checks in EnsureRequirements. Remove checks where all partition keys must be in join keys to allow isKeyCompatible = true in this case (if this flag is enabled) - "Project" partitions by join keys in KeyGroupedPartitioning/KeyGroupedShuffleSpec - Add join key grouping to the partition grouping in BatchScanExec ### Why are the changes needed? - Support Storage Partition Join in cases where the join condition does not contain all the partition keys, but just some of them ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? -Added tests in KeyGroupedPartitioningSuite -Because of apache#37886 we have to select all join keys to trigger SPJ in this case, otherwise DSV2 scan does not report KeyGroupedPartitioning and SPJ does not get triggered. Need to see how to relax this in separate PR.

sunchao

LGTM

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

dongjoon-hyun

+1, LGTM.

dongjoon-hyun · 2023-09-11T18:19:28Z

Merged to mater for Apache Spark 4.0.0. Thank you so much, @szehon-ho and @sunchao !

sunchao · 2023-09-11T18:23:10Z

Thanks @szehon-ho @dongjoon-hyun !

irsath · 2023-11-19T22:55:56Z

Hi @dongjoon-hyun @sunchao,
Do you see any blocker to backport this to spark 3.5 ?
I think it would be useful for many use case (including mine) that partition by date for GDPR purpose but still need SPJ on the other partitioning column.

dongjoon-hyun · 2023-11-19T23:00:17Z

Apache Spark has a back-porting policy which allows only bug fixes, @irsath . Given that this PR is an improvement, we are unable to touch the release branches like branch-3.5 for this improvement.

irsath · 2023-11-20T16:38:36Z

Right, sorry for the typo but I meant: what if we make a 3.6 with this PR ?

I never contributed to OSS spark but if your ok with the idea I can try to do a PR in that regard.

…keys - Add new conf spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled - Change key compatibility checks in EnsureRequirements. Remove checks where all partition keys must be in join keys to allow isKeyCompatible = true in this case (if this flag is enabled) - Change BatchScanExec/DataSourceV2Relation to group splits by join keys if they differ from partition keys (previously grouped only by partition values). Do same for all auxiliary data structure, like commonPartValues. - Implement partiallyClustered skew-handling. - Group only the replicate side (now by join key as well), replicate by the total size of other-side partitions that share the join key. - add an additional sort for partitions based on join key, as when we group the replicate side, partition ordering becomes out of order from the non-replicate side. - Support Storage Partition Join in cases where the join condition does not contain all the partition keys, but just some of them No -Added tests in KeyGroupedPartitioningSuite -Found two existing problems, will address in separate PR: - Because of apache#37886 we have to select all join keys to trigger SPJ in this case, otherwise DSV2 scan does not report KeyGroupedPartitioning and SPJ does not get triggered. Need to see how to relax this. - https://issues.apache.org/jira/browse/SPARK-44641 was found when testing this change. This pr refactors some of those code to add group-by-join-key, but doesnt change the underlying logic, so issue continues to exist. Hopefully this will also get fixed in another way. Closes apache#42306 from szehon-ho/spj_attempt_master. Authored-by: Szehon Ho <szehon.apache@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

github-actions bot added the SQL label Aug 2, 2023

szehon-ho changed the title ~~[SQL][SPARK-44647] Support SPJ where join keys are less than cluster keys~~ [SPARK-44647][SQL] Support SPJ where join keys are less than cluster keys Aug 2, 2023

szehon-ho force-pushed the spj_attempt_master branch from ccaa7a7 to 80a1ecd Compare August 8, 2023 18:35

szehon-ho mentioned this pull request Aug 15, 2023

[SPARK-41471][SQL] Reduce Spark shuffle when only one side of a join is KeyGroupedPartitioning #42194

Closed

sunchao reviewed Aug 16, 2023

View reviewed changes

szehon-ho force-pushed the spj_attempt_master branch 3 times, most recently from 059824b to 62fa5dd Compare August 24, 2023 22:34

szehon-ho commented Aug 24, 2023

View reviewed changes

szehon-ho force-pushed the spj_attempt_master branch from 62fa5dd to 8319e43 Compare August 24, 2023 22:46

szehon-ho force-pushed the spj_attempt_master branch 2 times, most recently from 048701b to dac34ec Compare September 5, 2023 12:52

sunchao reviewed Sep 7, 2023

View reviewed changes

szehon-ho force-pushed the spj_attempt_master branch 2 times, most recently from fe4920d to 3fd3a7b Compare September 8, 2023 06:18

szehon-ho added 2 commits September 8, 2023 23:15

Review comments

0df6e97

szehon-ho force-pushed the spj_attempt_master branch from 3fd3a7b to 0df6e97 Compare September 8, 2023 15:16

sunchao approved these changes Sep 8, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala Show resolved Hide resolved

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Sep 8, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Show resolved Hide resolved

dongjoon-hyun reviewed Sep 8, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

Review comments

a62e32b

szehon-ho force-pushed the spj_attempt_master branch from 69235f9 to a62e32b Compare September 9, 2023 01:38

szehon-ho added 2 commits September 9, 2023 17:32

Fix typo

6a7ca35

Fix sqlconf test

e832652

dongjoon-hyun approved these changes Sep 11, 2023

View reviewed changes

dongjoon-hyun closed this in 9520087 Sep 11, 2023

szehon-ho mentioned this pull request May 3, 2024

[SPARK-48065][SQL] SPJ: allowJoinKeysSubsetOfPartitionKeys is too strict #46325

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-44647][SQL] Support SPJ where join keys are less than cluster keys #42306

[SPARK-44647][SQL] Support SPJ where join keys are less than cluster keys #42306

szehon-ho commented Aug 2, 2023 •

edited

Loading

sunchao Aug 8, 2023

szehon-ho Aug 24, 2023

sunchao Aug 16, 2023

szehon-ho Aug 24, 2023

sunchao Aug 16, 2023

szehon-ho Aug 24, 2023

szehon-ho Aug 24, 2023

dongjoon-hyun commented Aug 28, 2023

szehon-ho commented Aug 28, 2023

dongjoon-hyun commented Aug 28, 2023

sunchao left a comment

sunchao Aug 25, 2023

szehon-ho Sep 8, 2023

sunchao Sep 7, 2023

sunchao Sep 7, 2023

szehon-ho Sep 8, 2023

szehon-ho commented Sep 8, 2023

sunchao left a comment

dongjoon-hyun left a comment

dongjoon-hyun commented Sep 11, 2023

sunchao commented Sep 11, 2023

irsath commented Nov 19, 2023

dongjoon-hyun commented Nov 19, 2023

irsath commented Nov 20, 2023


		object KeyGroupedShuffleSpec {

		def isExpressionCompatible(left: Expression, right: Expression): Boolean =

[SPARK-44647][SQL] Support SPJ where join keys are less than cluster keys #42306

[SPARK-44647][SQL] Support SPJ where join keys are less than cluster keys #42306

Conversation

szehon-ho commented Aug 2, 2023 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented Aug 28, 2023

szehon-ho commented Aug 28, 2023

dongjoon-hyun commented Aug 28, 2023

sunchao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szehon-ho commented Sep 8, 2023

sunchao left a comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Sep 11, 2023

sunchao commented Sep 11, 2023

irsath commented Nov 19, 2023

dongjoon-hyun commented Nov 19, 2023

irsath commented Nov 20, 2023

szehon-ho commented Aug 2, 2023 •

edited

Loading